Triton 프로그래밍 입문: 성능 역설: 올바른 코드가 느린 이유

이 성능 역설 수학적으로 완벽한 커널(예: $out = x + y$)는 GPU 하드웨어의 고정 비용을 분산시키지 못할 경우 실제로 CPU 루프보다 더 나쁜 성능을 보일 수 있다고 말한다. 이는 종종 런치 세금으로 나타난다.

1. '정확성'의 오류

기능적 정확성은 효율성의 대체 지표가 아니다. 당신의 Triton 코드가 수천 개의 스레드에 작업을 올바르게 분배하더라도, 작업량 전체(N)이 작다면 GPU는 여전히 미사용 상태로 남아 있다. 하드웨어는 실제 산술 연산보다 상태 전환에 더 많은 시간을 소요한다.

2. 파이썬 측정의 함정

파이썬에서 time.time() 를 사용해 GPU 코드를 벤치마킹하는 것은 위험하다. GPU 호출은 비동기적; 파이썬은 단지 큐에 넣고 명령을 큐에 넣고 다음 명령으로 넘어간다. 만약 torch.cuda.synchronize()가 없다면 큐잉 시간을 측정하게 된다. 동기화를 사용하면 호스트에서 디바이스로의 지연을 측정하게 되며, 이는 커널 실행 시간보다 흔히 10배 이상 길다.

3. 지연 시간과 처리량

이 역설을 극복하기 위해선 런치 지연 시간을 '숨길' 만큼 충분한 작업을 제공해야 한다. 이것은 지연 시간 제한 모드(호스트-디바이스 버스 제한)에서 처리량 제한 모드(그래픽 메모리 또는 계산 능력 제한)로의 전환이다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.